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The state of the art in data compression is arithmetic coding, not the better- 
known Huffman method. Arithmetic coding gives greater compression, is 
faster for adaptive models, and clearly separates the model from the channel 
encoding. 
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Arithmetic coding is superior in most respects to the 
better-known Huffman [10] method. It represents in- 
formation at least as compactly — sometimes consid- 
erably more so. Its performance is optimal without 
the need for blocking of input data. It encourages a 
clear separation between the model for representing 
data and the encoding of information with respect to 
that model. It accommodates adaptive models easily 
and is computationally efficient. Yet many authors 
and practitioners seem unaware of the technique. 
Indeed there is a widespread belief that Huffman 
coding cannot be improved upon. 

We aim to rectify this situation by presenting an 
accessible implementation of arithmetic coding and 
by detailing its performance characteristics. We start 
by briefly reviewing basic concepts of data compres- 
sion and introducing the model-based approach that 
underlies most modern techniques. We then outline 
the idea of arithmetic coding using a simple exam- 
ple, before presenting programs for both encoding 
and decoding. In these programs the model occupies 
a separate module so that different models can easily 
be used. Next we discuss the construction of fixed 
and adaptive models and detail the compression 
efficiency and execution time of the programs, 
including the effect of different arithmetic word 
lengths on compression efficiency. Finally, we out- 
line a few applications where arithmetic coding is 
appropriate. 
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DATA COMPRESSION 

To many, data compression conjures up an assort- 
ment of ad hoc techniques such as conversion of 
spaces in text to tabs, creation of special codes for 
common words, or run-length coding of picture data 
(e.g., see [8]). This contrasts with the more modern 
model-based paradigm for coding, where, from an 
input string of symbols and a model, an encoded string 
is produced that is (usually) a compressed version of 
the input. The decoder, which must have access to 
the same model, regenerates the exact input string 
from the encoded string. Input symbols are drawn 
from some well-defined set such as the ASCII or 
binary alphabets; the encoded string is a plain se- 
quence of bits. The model is a way of calculating, in 
any given context, the distribution of probabilities 
for the next input symbol. It must be possible for the 
decoder to produce exactly the same probability dis- 
tribution in the same context. Compression is 
achieved by transmitting the more probable symbols 
in fewer bits than the less probable ones. 

For example, the model may assign a predeter- 
mined probability to each symbol in the ASCII 
alphabet. No context is involved. These probabilities 
can be determined by counting frequencies in repre- 
sentative samples of text to be transmitted. Such a 
fixed model is communicated in advance to both en- 
coder and decoder, after which it is used for many 
messages. 

Alternatively, the probabilities that an adaptive 
model assigns may change as each symbol is trans- 
mitted, based on the symbol frequencies seen so far 
in the message. There is no need for a representative 
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sample of text, because each message is treated as 
an independent unit, starting from scratch. The en- 
coder's model changes with each symbol transmit- 
ted, and the decoders changes with each symbol 
received, in sympathy. 

More complex models can provide more accurate 
probabilistic predictions and hence achieve greater 
compression. For example, several characters of pre- 
vious context could condition the next-symbol prob- 
ability. Such methods have enabled mixed-case Eng- 
lish text to be encoded in around 2.2 bits/character 
with two quite different kinds of model [4, 6], Tech- 
niques that do not separate modeling from coding 
so distinctly, like that of Ziv and Lempel [23], do 
not seem to show such great potential for compres- 
sion, although they may be appropriate when the 
aim is raw speed rather than compression per- 
formance [22]. 

The effectiveness of any model can be measured 
by the entropy of the message with respect to it, 
usually expressed in bits/symbol. Shannon's funda- 
mental theorem of coding states that, given messages 
randomly generated from a model, it is impossible to 
encode them into less bits (on average) than the en- 
tropy of that model [21]. 

A message can be coded with respect to a model 
using either Huffman or arithmetic coding. The for- 
mer method is frequently advocated as the best pos- 
sible technique for reducing the encoded data rate. 
It is not. Given that each symbol in the alphabet 
must translate into an integral number of bits in the 
encoding, Huffman coding indeed achieves "mini- 
mum redundancy." In other words, it performs opti- 
mally if all symbol probabilities are integral powers 
of Vz. But this is not normally the case in practice; 
indeed, Huffman coding can take up to one extra bit 
per symbol. The worst case is realized by a source 
in which one symbol has probability approaching 
unity. Symbols emanating from such a source con- 
vey negligible information on average, but require at 
least one bit to transmit [7]. Arithmetic coding dis- 
penses with the restriction that each symbol must 
translate into an integral number of bits, thereby 
coding more efficiently. It actually achieves the the- 
oretical entropy bound to compression efficiency for 
any source, including the one just mentioned. 

In general, sophisticated models expose the defi- 
ciencies of Huffman coding more starkly than simple 
ones. This is because they more often predict sym- 
bols with probabilities close to one, the worst case 
for Huffman coding. For example, the techniques 
mentioned above that code English text in 2.2 bits/ 
character both use arithmetic coding as the final 
step, and performance would be impacted severely 



if Huffman coding were substituted. Nevertheless, 
since our topic is coding and not modeling, the illus- 
trations in this article all employ simple models. 
Even so, as we shall see, Huffman coding is inferior 
to arithmetic coding. 

The basic concept of arithmetic coding can be 
traced back to Elias in the early 1960s (see [1, 
pp. 61-62]). Practical techniques were first intro- 
duced by Rissanen [16] and Pasco [15], and de- 
veloped further by Rissanen [17]. Details of the 
implementation presented here have not appeared 
in the literature before; Rubin [20] is closest to our 
approach. The reader interested in the broader class 
of arithmetic codes is referred to [18]; a tutorial is 
available in [13]. Despite these publications, the 
method is not widely known. A number of recent 
books and papers on data compression mention it 
only in passing, or not at all. 

THE IDEA OF ARITHMETIC CODING 

In arithmetic coding, a message is represented by an 
interval of real numbers between 0 and 1. As the 
message becomes longer, the interval needed' to rep- 
resent it becomes smaller, and the number of bits 
needed to specify that interval grows. Successive 
symbols of the message reduce the size of the inter- 
val in accordance with the symbol probabilities gen- 
erated by the model. The more likely symbols re- 
duce the range by less than the unlikely symbols 
and hence add fewer bits to the message. 

Before anything is transmitted, the range for the 
message is the entire interval [0, 1), denoting the 
half-open interval 0 < x < 1. As each symbol is 
processed, the range is narrowed to that portion of it 
allocated to the symbol. For example, suppose the 
alphabet is \a, e, i, o, u, /), and a fixed model is used 
with probabilities shown in Table I. Imagine trans- 

TABLE I. Example Fixed Model for Alphabet \a t e, i, o, u, .') 

Symbol Probability Range 

- _._ _ ^ ^ 

e .3 [0.2,0.5) 

i .1 [0.5, 0.6) 

o 2 [0.6. 0.8) 

u .1 [0.8, 0.9) 

/ .1 [0.9,1.0) 



mitting the message eaiiL Initially, both encoder 
and decoder know that the range is [0, 1). After 
seeing the first symbol, e, the encoder narrows it to 
[0.2, 0.5), the range the model allocates to this sym- 
bol. The second symbol, a, will narrow this new 
range to the first one-fifth of it, since a has been 
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allocated [0, 0.2). This produces [0.2, 0.26), since the 
previous range was 0.3 units long and one-fifth of 
that is 0.06. The next symbol, i f is allocated [0.5, 0.6), 
which when applied to [0.2, 0.26) gives the smaller 
range [0.23, 0.236). Proceeding in this way, the en- 
coded message builds up as follows: 



Initially 
After seeing e 
a 



[0, 

[0.2, 

[0.2, 

[0.23, 

[0.233, 



1) 

0.5) 
0.26) 
0.236) 
0.2336) 



[0.23354, 0.2336) 

Figure 1 shows another representation of the en- 
coding process. The vertical bars with ticks repre- 
sent the symbol probabilities stipulated by the 
model. After the first symbol has been processed, the 
model is scaled into the range [0.2, 0.5), as shown in 



After 



Nothing 



1-i 



0-1 



FIGURE 1a. Representation of the Arithmetic Coding Process 



Figure la. The second symbol scales it again into the 
range [0.2, 0,26). But the picture cannot be contin- 
ued in this way without a magnifying glass! Conse- 
quently, Figure lb shows the ranges expanded to 
full height at every stage and marked with a scale 
that gives the endpoints as numbers. 

Suppose all the decoder knows about the message 
is the final range, [0.23354, 0.2336). It can immedi- 
ately deduce that the first character was e, since the 
range lies entirely within the space the model of 
Table I allocates for e. Now it can simulate the oper- 
ation of the encoder: 



Initially 
After seeing e 



[0, 1) 
[0.2, 0.5) 



This makes it clear that the second character is a t 
since this will produce the range 

After seeing a [0.2, 0.26), 

which entirely encloses the given range [0.23354, 
0.2336). Proceeding like this, the decoder can iden- 
tify the whole message. 

It is not really necessary for the decoder to know 
both ends of the range produced by the encoder. 
Instead, a single number within the range — for ex- 
ample, 0.23355 — will suffice. (Other numbers, like 
0.23354, 0.23357, or even 0.23354321, would do just 
as well.) However, the decoder will face the problem 
of detecting the end of the message, to determine 
when to stop decoding. After all, the single number 

0.0 could represent any of a, aa, aaa, aaaa, To 

resolve the ambiguity, we ensure that each message 
ends with a special EOF symbol known to both en- 
coder and decoder. For the alphabet of Table I, / will 
be used to terminate messages, and only to termi- 



After 

seeing Nothing 



1-i 



0.5-, , 



0.2- 1 




0.236-1 , 



0.2 J 



0.23- 1 




0.2336 n 



0.233- 1 




0.2336-1 



0.23354 -J 



FIGURE 1b. Representation of the Arithmetic Coding 
Process with the Interval Scaled Up at Each Stage 
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/* ARITHMETIC ENCODING ALGORITHM. */ 

/* Call encode_symbol repeatedly for each symbol in the message. */ 

/* Ensure that a distinguished "terminator" symbol is encoded last, then */ 

/* transmit any value in the range [low, high). */ 

en codecs ymbol (symbol, cum_freq) 
range - high - low 

high » low + range* cum_freq I symbol- 1] 
low - low + range *cum_freq (symbol] 



/* ARITHMETIC DECODING ALGORITHM. */ 

/* "Value" is the number that has been received. */ 
/* Continue calling decode_symbol until the terminator symbol is returned. */ 

decode_symbol (cum_freq) 
find symbol such that 

cum_freq( symbol) <- (value-low) / (high-low) < cum_freq [symbol- 1] 

/* This ensures that value lies within the new */ 
/* (low, high) range that will be calculated by */ 
/* the following lines of code. */ 

range * high - low 

high « low + range*cum_freq[symbol-l) 
low « low + range *cum~ireq[ symbol ] 
return symbol 



RGURE 2. Pseudocode for the Encoding and Decoding Procedures 



nate messages. When the decoder sees this symbol, 
it stops decoding. 

Relative to the fixed model of Table I, the entropy 
of the five-symbol message eaii! is 

-log 0.3 - log 0.2 - log 0.1 - log 0.1 - log 0.1 

- -log 0.00006 * 4.22 

(using base 10, since the above encoding was per- 
formed in decimal). This explains why it takes five 
decimal digits to encode the message. In fact, the 
size of the final range is 0.2336 - 0.23354 = 0.00006, 
and the entropy is the negative logarithm of this 
figure. Of course, we normally work in binary, 
transmitting binary digits and measuring entropy 
in bits. 

Five decimal digits seems a lot to encode a mes- 
sage comprising four vowels! It is perhaps unfortu- 
nate that our example ended up by expanding 
rather than compressing. Needless to say, however, 
different models will give different entropies. The 
best single-character model of the message eaii! is 
the set of symbol frequencies (e(0.2), a(0.2), i(0.4), 
/(0.2)|, which gives an entropy of 2.89 decimal digits. 
Using this model the encoding would be only three 
digits long. Moreover, as noted earlier, more sophis- 
ticated models give much better performance 
in general. 



A PROGRAM FOR ARITHMETIC CODING 

Figure 2 shows a pseudocode fragment that summa- 
rizes the encoding and decoding procedures devel- 
oped in the last section. Symbols are numbered, 1, 2, 

3, The frequency range for the ith symbol is 

from cum-freq[i] to cum„freq[i - 1]. As i decreases, 
cum~freq[i] increases, and cum„freq[0] = 1. (The 
reason for this "backwards" convention is that 
cum-freq[0] will later contain a normalizing factor, 
and it will be convenient to have it begin the array.) 
The "current interval" is [low, high), and for both 
encoding and decoding, this should be initialized 
to [0, 1). 

Unfortunately, Figure 2 is overly simplistic. In 
practice, there are several factors that complicate 
both encoding and decoding: 

Incremental transmission and reception. The encode 
algorithm as described does not transmit anything 
until the entire message has been encoded; neither 
does the decode algorithm begin decoding until it 
has received the complete transmission. In most 
applications an incremental mode of operation is 
necessary. 

The desire to use integer arithmetic. The precision 
required to represent the [low, high) interval grows 
with the length of the message. Incremental opera- 
tion will help overcome this, but the potential for 
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overflow and underflow must still be examined 
carefully. 

Representing the model so that it can be consulted 
efficiently. The representation used for the model 
should minimize the time required for the decode 
algorithm to identify the next symbol. Moreover, 
an adaptive model should be organized to minimize 
the time-consuming task of maintaining cumulative 
frequencies. 



Figure 3 shows working code, in C, for arithmetic 
encoding and decoding. It is considerably more de- 
tailed than the bare-bones sketch of Figure 2! Imple- 
mentations of two different models are given in 
Figure 4; the Figure 3 code can use either one. 

The remainder of this section examines the code 
of Figure 3 more closely, and includes a proof that 
decoding is still correct in the integer implementa- 
tion and a review of constraints on word lengths in 
the program. 



arithmetic^coding.h 



/* DECLARATIONS USED FOR ARITHMETIC ENCODING AND DECODING •/ 



4 /• SIZE OF ARITHMETIC CODE VALUES. •/ 

6 ^define Code_value_bits 16 /* Number of bits in a code value */ 

7 typedef long~code_value; /* Type of an arithmetic code value •/ 

9 •define Top value < ({long) l«Code_value_blta) -1) /* Largest code value */ 

10 
11 

12 /* HALF AND QUARTER POINTS IN THE CODE VALUE RANGE. */ 
13 

14 #deflne Flrst_qtr (Top value/4+1) /* Point after first quarter •/ 

15 Idefine Half (2*Flrst_qtr) /• Point after first half •/ 

16 idefine Thlrd_qtr (3*Flret_qtr> /• Point after third quarter */ 



model .h 



/« INTERFACE TO THE MODEL. •/ 



/* THE SET OF SYMBOLS THAT MAY BE ENCODED. ♦/ 

•define No_of chars 256 /* Number of character symbols •/ 

•define E0F_symbol <No_of_charsfl) /• Index of EOF symbol */ 

•define No_of_symbols (No_of_chars+l) /• Total number of symbols •/ 

/• TRANSLATION TABLES BETWEEN CHARACTERS AND SYMBOL INDEXES. •/ 

int char__to_index[No_of_chars); /• To Index from character •/ 

unsigned char lndex_to_char (No_of_symbols+l ) ; /* To character from index •/ 



/• CUMULATIVE FREQUENCY TABLE. •/ 

•define Max frequency 16383 /• Maximum allowed frequency count •/ 

/* 2*14 - 1 •/ 

int cum_freq[No_of_symbols+l) ; /• Cumulative symbol frequencies */ 



FIGURE 3. C Implementation of Arithmetic Encoding and Decoding 
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encode. c 



39 /* MAIN PROGRAM FOR ENCODING. */ 
40 

41 Mnclude <stdlo.h> 

42 #lnclude -model. h* 
43 

44 main() 

45 ( startjnodel (); /• Set up other modules. */ 

46 start_outputing_bits () ; 

47 start_encoding () ; 

48 for (;;) ( /• Loop through characters. •/ 

49 lnt ch; lnt symbol; 

50 ch - getc(stdln); /* Read the next character. •/ 

51 If (ch— EOF) break; /* Exit loop on end-of-f i le. •/ 

52 symbol - char_to_index [ch} ; /• Translate to an index. •/ 

53 encode symbol (symbol , cum_freq> ; /• Encode that ■ymbol. •/ 

54 update model (symbol) ? /• Update the model. •/ 

55 ) 

56 encode_symbol (E0F_symbol, cujn_freq) ; /* Encode the EOF symbol. •/ 

57 dohe_encoding () ; ~ /* Send the last few bits. •/ 

58 done output ing_blts () ; 

59 exitlO); 

60 ) 



arithmetic encode. c 



61 /• ARITHMETIC ENCODING ALGORITHM. */ 
62 

63 #lnclude "arlthmetlc_codlng.h" 
64 

65 static void bii_plu5_follow() ; /• Routine that follows •/ 

66 

67 

68 /* CURRENT STATE OF THE ENCODING. */ 
69 

70 static code_value low, high; /• Ends of the current code region •/ 

71 static long blts_to_fol low; /* Number of opposite bits to output after •/ 

72 ~ /* the next bit. */ 
73 

74 

75 /• START ENCODING A STREAM OF SYMBOLS . */ 
76 

77 start_encodlng () 

78 | low - 0; /• Full code range. •/ 

79 high - Top_value; 

80 bits_to_follow - 0; /• No bits to follow next. •/ 

81 ) 
82 

83 

84 /• ENCODE A SYMBOL. */ 



85 

86 encode_symbol (symbol, cura_freq) 

87 lnt symbol; /* Symbol to encode */ 

88 lnt cum_freq(); /* Cumulative symbol frequencies •/ 

89 ( long range; /* Size of the current code region •/ 

90 range - (long) (high-low) +1; 

91 high - low ♦ /* Narrow the code region •/ 

92 (ran s ie*cum_freq I symbol- 1) ) /cum_freq [0 J -1; /• to that allotted to this */ 

93 low - low ♦ ~ /• symbol. •/ 

94 (iange # cum_f req (symbol ) ) /cum_f req (0 i ; 



FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued) 
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95 for (;;) [ /• Loop to output bits. */ 

96 if (hlgh<Half) ( 

97 bit_plus follow(O); /• Output 0 if in low half. •/ 

98 | 

99 else If <low>-Half) | /* Output 1 If In high half.*/ 

100 bit_plus_follow(l) ; 

101 low — Half; 

102 high — Half; /• Subtract offset to top. •/ 

103 ) 

104 else If (1 owWlrst_Q.tr /* Output an opposite bit •/ 

105 t* high<Third_qtr) { /• later if in middle half. •/ 

106 bits to_follow +- 1; 

107 low Flret_qtr; /* Subtract offset to middle*/ 
100 high — First_qtr; 

109 ) 

110 else break; /* Otherwise exit loop. •/ 

111 low - 2*low; 

112 high - 2*hlgh+l; /* Scale up code range. •/ 

113 ) 

114 ) 
115 
116 

117 /• FINISH ENCODING THE STREAM. ♦/ 
11B 

119 done_en coding () 

120 | btts_to follow 1; /* Output two bits that •/ 

121 if (low<First qtr) blt_plus_follow<0) ; /* select the quarter that */ 

122 else bit_plua~follow (1) ; /• the current code range */ 

123 ) ' /* contains. •/ 
124 

125 

126 /* OUTPUT BITS PLUS FOLLOWING OPPOSITE BITS. •/ 
127 

128 static void blt_plus follow(blt) 

129 int bit; 

130 { output bit (bit); /* Output the bit. •/ 

131 while 7bits_to_follow>0) ( 

132 output_bit ( Jbit); /* Output bits to_follow */ 

133 blts_to follow — 1; /* opposite bits. Set •/ 

134 } ~ /* blta_to_follow to zero. •/ 

135 } 



decode. c 



136 /* MAIN PROGRAM FOR DECODING. */ 
137 

138 ^include <stdlo.h> 

139 finclude "model. h- 
140 

141 malnO 

1 42 I start model (); /* Set up other modules. •/ 



143 start_inputingjblts O ; 

144 start_decodingT) ; 

145 for (;;) { /* Loop through characters. •/ 

146 int ch; int symbol; 

147 symbol - de codecs ymbo 1 (cura_freq) ; /• Decode next symbol. */ 

148 if (symbol— EOF_s ymbo 1) break; /• Exit loop if EOF symbol. •/ 

149 ch - lndex_to_char (symbol j; /* Translate to a character.*/ 

150 putc(ch,stdout) ; /• Write that character. */ 

151 update_modei (symbol) ; /* Update the model. */ 

152 ) 

153 exit. (0); 



154 | 



FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued) 
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arithmetic decode. c 



155 /* ARITHMETIC DECODING ALGORITHM. */ 
156 

157 linclude "ar J thmet Ic__codlng .h" 

158 

159 

160 /• CURRENT STATE OF THE DECODING. */ 
161 

162 static code_value value; /• Current ly- seen code value •/ 

163 static code_value low, high; /* Ends of current code region •/ 
164 
165 

166 /• START DECODING A STREAM OF SYMBOLS. */ 
167 

168 start_decoding {) 

169 [ int 1; 

170 value • 0; 

171 for (i - 1; K-Code_valueJ)its; + ) { 

172 value - 2*val ue+lnput~bit () ; 

173 J 

174 low - 0; 

175 high - Top value; 

176 | 
177 
178 

179 /* DECODE THE NEXT SYMBOL . */ 
180 



181 int decode_symbol (cum_f req) 

182 Int cum_freq( ); " /* Cumulative symbol frequencies */ 

183 { long range; /* Sire of current code region •/ 

184 int cum; /* Cumulative frequency calculated •/ 

185 Int symbol; /* Symbol decoded •/ 

186 range " (long) (high-low) +1; 

187 cum - /* Find cum freq for value. •/ 

188 ( ( (long) (value-low) tl) *cum_freq[0 j-1) /range; 

189 for (symbol - 1; cum_f req (symbol J >cum; symbol +♦) ; /• Then find symbol. •/ 

190 high - low ♦ /* Narrow the code region */ 

191 < ra nge * cum freq [symbol- 1] ) /cum_f req [0 J -1> /* to that allotted to this •/ 

192 low - low + ~ /• symbol. •/ 

193 (range*cum_freq [symbol ) ) /cum_f req (0 ) ; 

194 for (;;) ( /• Loop to get rid of bits. •/ 

195 if (hlgh<Half) I 

196 /* nothing */ /* Expand low half. */ 

197 ) 

198 else if (low>-Half) ( /* Expand high half. */ 

199 value — Half; 

200 low Half; /* Subtract offset to top. V 

201 high — Half; 

202 ) 

203 else if (low>-First qtr /* Expand middle half. */ 

204 44 hlgh<Third~qtr) ( 

205 value -- Flrst_qtr; 

206 low -- First_qtr; /• Subtract offset to middle*/ 

207 high — First qtr; 

208 ) 

209 else bieax; /* Otherwise exit loop. •/ 

210 low - 2«low; 

211 high - 2*hlgh+l; /• Scale up code range. */ 

212 value - 2 'value* Input bit () ; /* Move in next input bit, •/ 

213 ) 

214 return symbol; 

215 ) 



/* Input bits to fill the •/ 
/* code value. */ 

/* Full code range. •/ 



FIGURE 3. C Implementation of Arithmetic Encoding and Decoding {continued) 
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bit_lnput .c 



216 /* BIT INPUT ROUTINES. •/ 
217 

218 Unclude <stdlo.h> 

219 linclude "arithmet ic_coding. h" 
220 

221 

222 /• THE BIT BUFFER. V 
223 

224 static lnt buffer; /* Bits waiting to be input •/ 

225 static int blts_to_go; /* Number of bits still in buffer •/ 

226 static lnt garbage bits; /* Number of bits past end-of-flle •/ 
227 

22S 

229 /* INITIALIZE BIT INPUT. ♦/ 
230 

231 »tart_inputlng_bltaO 

232 { bTts_to_go""- 0; /* Buffer starts out with •/ 

233 garbagejoits - 0; /* no bits in it. */ 

234 ) 
235 
236 

237 /* INPUT A BIT, */ 
238 

239 int input bit 0 

240 ( int t7 

241 if (bits_to_go— 0) < /* Read the next byte if no •/ 

242 buffer - getc(stdin); /* bits are left in buffer. •/ 

243 if (buffer— EOF) { 

244 garbage_bits +- X; /* Return arbitrary bits - / 

245 if (garbage_bits>Code_value_blts-2> { /* after eof, but check •/ 

246 fprlntf (stderr, -Bad input file\n p ); /* for too many such. •/ 

247 exit(-l>; 

248 } 

249 J 

250 bits to_go - 8; 

251 I 

252 t - bufferU; /* Return the next bit from */ 

253 buffer »• 1; /* the bottom of the byte. */ 

254 bits_to_go — 1; 

255 return t; 

256 } 
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bit_output ,c 



257 
258 
259 
2 60 
261 
262 
263 
264 
265 
266 
267 
268 
269 
270 
271 
272 
273 
274 
275 
276 
277 
278 
279 
280 
281 
282 
283 
284 
285 
286 
287 
288 
289 
290 
291 
292 
293 
294 



/* BIT OUTPUT ROUTINES. */ 
•Include <stdio.h> 

/* THE BIT BUFFER. •/ 

static int buffer; 
static int blts_to_go; 

/* INITIALIZE FOR BIT OUTPUT. */ 

start ^output Ing blts() 
{ buffer - 0;*" 
bits_to_go« 8; 

) 

/* OUTPUT A BIT. */ 

output_bit (bit) 

Int bit; 
I buffer »- 1; 

If (bit) buffer I- 0x80; 

bits to go — 1; 

If (blt8_to_go— 0) I 

put c (buff er, stdout) ; 
bits_to_go - 8; 

) 

» 



/* FLUSH OUT THE LAST BITS. •/ 

done_outputing_blta () 

{ putc(buffer»blts to_go, stdout ) ; 

) 



/* Bits buffered for output 
/* Number of bits free In buffer 



/* Buffer U empty to start */ 
/• with. V 



/* Put bit In top of buffer.*/ 



/• Output buffer If it is •/ 
/* now full. V 



FIGURE 3. C Implementation of Arithmetic Encoding and Decoding {continued) 



June 1987 Volume 30 Number 6 



Communications of the ACM 



529 



Computing Practices 



fixed_model.c 
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29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
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/• THE FIXED SOURCE MODEL */ 
•include "model. h" 
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/* INITIALIZE THE MODEL. •/ 

start nodal () 
{ lnt 1; 

for (1 - 0; KNo_of_chars; + ) \ 
char to_index[l) - 1+1; 
index_to char (1+1) - 1; 

I 

cum_freq[No_oi_6ymbols) - 0; 

for (1 - No_of_symbols,- 1>0; 1—) { 

cum_froq(l-l] - cum freqtlj ♦ freq(l); 

I 



/* Set up tables that */ 
/* translate between symbol */ 
/* Indexes and characters. */ 



/* Set up cumulative 
/* frequency counts. 



If (cum_freq[0) > Max_f requency) abort (); /* Check counts within limit*/ 



/• UPDATE THE MODEL TO ACCOUNT FOR A NEW SYMBOL . •/ 



updatejnodel (symbol) 
lnt symbol; 

i 
) 



/•Do nothing. 



FIGURE 4. Fixed and Adaptive Models for Use with Figure 3 
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adapt i ve_mode 1 . c 



/* THE ADAPTIVE SOURCE MODEL */ 
•Include "model. h" 

Int freq(No_of_symbols+l] ; /• Symbol frequencies */ 



/• INITIALIZE THE MODEL. ♦/ 



stsrt_raodel () 
I int i; 

for (i - 0; l<No_of_chars; !♦♦) [ 

char_to_index[lT - 1+1; 

index_to_charU+l) - 1; 

J 

for {1 - 0; i<-No_of symbols; { 
freq(l] - 1; 

cum_freq(i) - No of symbols-1; 

I 

freq(0} - 0; 



/• Set up tables that • 

/* translate between symbol * 

/* Indexes and characters. • 

/* Set up Initial frequency * 

/• counts to be one for all * 

/• symbols. * 

/* Freq(0] must not be the * 

/• same as freqll). * 



/* UPDATE THE MODEL TO ACCOUNT FOR A NEW SYMBOL. */ 



50 
51 
52 
53 



update__model (symbol ) 

int symbol; 
[ int 1? 

if (cum_f req{0) — 
int cum; 



•Max 



/* Indi 
/* New 
frequency) { 



index for 



for 



- 0; 

(i - No_of_eymbols; i>-0; 
freq[i| - (freq [1 J +1) /2; 
cum_freq(l] - cum; 
cum freq [i ] ; 



I— ) I 



symbol 




V 


symbol 




•/ 


/* See if 


frequency counts 


*/ 


/* are at 


their maximum. 


•/ 


/* If so. 


halve all the 


♦/ 


/* counts 


(Keeping them 


•/ 


/• non-tero) . 


*/ 



Update the translation 
tables if the symbol has 
moved. 



I 

for (i - symbol; freq ( i 1 —freq [ 1-1 J * 1—) ; /* Find symbol's new index 
if (Ksymbol) { 

int ch_i, ch_symbol; 

ch_i ■ lndex_to_chsr (i) ; 

ch_symbo 1 - 7ndex_t o_char ( symbo 1 ) ; 

index_to_char (1 J - ch_symbol; 

index_to_char I symbol] - ch_l; 

char_to_index |ch_i ) - symbol; 

char~to index I ch_ symbol J - i; 

) 

freqlij +- 1; 
while (i>0) { 
i — 1; 

cum freq(i) +- 1; 

) 



/* Increment the 

/* count for the 

/* update the cumulative 

/* frequencies. 



frequency 
symbol and 



FIGURE 4. Fixed and Adaptive Models for Use with Figure 3 [continued) 
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Representing the Model 

Implementations of models are discussed in the next 
section; here we are concerned only with the inter- 
face to the model (lines 20-38). In C, a byte is repre- 
sented as an integer between 0 and 255 (a char). 
Internally, we represent a byte as an integer be- 
tween 1 and 257 inclusive (an index), EOF being 
treated as a 257th symbol. It is advantageous to sort 
the model into frequency order, so as to minimize 
the number of executions of the decoding loop 
(line 189). To permit such reordering, the char/index 
translation is implemented as a pair of tables, 
index -to -char[] and char„to-index[]. In one of our 
models, these tables simply form the index by adding 
1 to the char, but another implements a more com- 
plex translation that assigns small indexes to fre- 
quently used symbols. 

The probabilities in the model are represented as 
integer frequency counts, and cumulative counts are 
stored in the array cwm_ freq[]. As previously, this 
array is "backwards," and the total frequency count, 
which is used to normalize all frequencies, appears 
in cum-freq[0]. Cumulative counts must not exceed 
a predetermined maximum, Max- frequency, and the 
model implementation must prevent overflow by 
scaling appropriately. It must also ensure that neigh- 
boring values in the cum„freq[] array differ by at 
least 1; otherwise the affected symbol cannot be 
transmitted. 



Incremental Transmission and Reception 

Unlike Figure 2 the program in Figure 3 repre- 
sents low and high as integers. A special data type, 
code^value, is defined for these quantities, together 
with some useful constants: Top-value, representing 
the largest possible codejoalue, and First-qtr, Half, 
and Third^qtr, representing parts of the range 
(lines 6-16). Whereas in Figure 2 the current inter- 
val is represented by [low, high), in Figure 3 it is 
[low, high]; that is, the range now includes the value 
of high. Actually, it is more accurate (though more 
confusing) to say that, in the program in Figure 3, 
the interval represented is [low, high + 0.11111 • • •)• 
This is because when the bounds are scaled up to 
increase the precision, zeros are shifted into the low- 
order bits of low, but ones are shifted into high. Al- 
though it is possible to write the program to use a 
different convention, this one has some advantages 
in simplifying the code. 

As the code range narrows, the top bits of low and 
high become the same. Any bits that are the same 
can be transmitted immediately, since they cannot 
be affected by future narrowing. For encoding, since 
we know that low < high, this requires code like 



for ( ; ; ) f 

if (high < Half ) j 
output_bit(0) ; 
low = 2*low; 
high = 2*high+1 ; 

) 

else if (low > Half ) | 
output__bit ( 1 ) ; 
low = 2* (low-Half ) ; 
high = 2*(high-Half ) + 1 ; 

} 

else break; 

I 

which ensures that, upon completion, low < Half 
< high. This can be found in lines 95-113 of 
encode symbol [), although there are some extra com- 
plications caused by underflow possibilities (see the 
next subsection). Care is taken to shift ones in at the 
bottom when high is scaled, as noted above. 

Incremental reception is done using a number 
called value as in Figure 2, in which processed 
bits flow out the top (high-significance) end and 
newly received ones flow in the bottom. Initially, 
start -decodingl) (lines 168-176) fills value with re- 
ceived bits. Once decode symbol (} has identified the 
next input symbol, it shifts out now-useless high- 
order bits that are the same in low and high, shifting 
value by the same amount (and replacing lost bits by 
fresh input bits at the bottom end): 



for ( ; y ) I 

if (high < Half ) { 



value 


= 2*value+input_bit < ) ; 


low 


= 2*low; 


high 


= 2*high+1 ; 


i 

else if 


(low > Half ) I 


value 


= 2*(value-Half )+input_bit ( ) 


low 


= 2* (low-Half ) ; 


high 


= 2*(high-Half )+1 ; 



else break; 

) 

(see lines 194-213, again complicated by precautions 
against underflow, as discussed below). 



Proof of Decoding Correctness 

At this point it is worth checking that identification 
of the next symbol by decode symbol [) works 
properly. Recall from Figure 2 that decode symbol [) 
must use value to find the symbol that, when en- 
coded, reduces the range to one that still includes 
value. Lines 186-188 in decode symbol () identify the 
symbol for which 
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cum-freq[symbol] 

^ I [value - low + 1) ♦ cum_freq[0] — 1 1 
" L ni gh -low -hi J 

< cum.-freq[symbol - 1], 

where LJ denotes the "integer part of" function that 
comes from integer division with truncation. It is 
shown in the Appendix that this implies 

I (high - low + 1) » cum-freq[symbol ] I 
0W + L c«m_/re 9 [0] ~~ ' J 

< u < low 

j I (ftigft - tow -hi)* cum-freq[symbol zH \ 1 
[ cum.freq[0] J 

so that pa/we lies within the new interval that 
decode symbol () calculates in lines 190-193. This is 
sufficient to guarantee that the decoding operation 
identifies each symbol correctly. 

Underflow 

As Figure 1 shows, arithmetic coding works by scal- 
ing the cumulative probabilities given by the model 
into the interval [low, high] for each character trans- 
mitted. Suppose low and high are very close to- 
gether — so close that this scaling operation maps 
some different symbols of the model onto the same 
integer in the [low, high] interval. This would be 
disastrous, because if such a symbol actually oc- 
curred it would not be possible to continue encod- 
ing. Consequently, the encoder must guarantee 
that the interval [low, high] is always large enough 
to prevent this. The simplest way to do this is 
to ensure that this interval is at least as large as 
Max- frequency, the maximum allowed cumulative 
frequency count (line 36). 

How could this condition be violated? The bit- 
shifting operation explained above ensures that low 
and high can only become close together when they 
straddle Half. Suppose in fact they become as close 
as 

First -qtr < low < Half < high < Third -qtr. 

Then the next two bits sent will have opposite polar- 
ity, either 01 or 10. For example, if the next bit turns 
out to be zero (i.e., high descends below Half and 
[0, Half] is expanded to the full interval), the bit 
after that will be one, since the range has to be 
above the midpoint of the expanded interval. Con- 
versely, if the next bit happens to be one, the one 
after that will be zero. Therefore the interval can 
safely be expanded right now, if only we remember 
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that, whatever bit actually comes next, its opposite 
must be transmitted afterwards as well. Thus lines 
104-109 expand [First-qtr, Third-qtr] into the whole 
interval, remembering in bits -to -follow that the 
bit that is output next must be followed by an oppo- 
site bit. This explains why all output is done via 
bit -plus - follow () (lines 128-135), instead of directly 
with output-bit{). 
But what if, after this operation, it is still true that 

First-qtr < low < Half < high < Third-qtr? 

Figure 5 illustrates this situation, where the current 
[low, high] range (shown as a thick line) has been 
expanded a total of three times. Suppose the next bit 
turns out to be zero, as indicated by the arrow in 
Figure 5a being below the halfway point. Then the 
next three bits will be ones, since the arrow is not 
only in the top half of the bottom half of the original 
range, but in the top quarter, and moreover the top 
eighth, of that half — this is why the expansion can 
occur three times. Similarly, as Figure 5b shows, if 
the next bit turns out to be a one, it will be followed 
by three zeros. Consequently, we need only count 
the number of expansions and follow the next bit by 
that number of opposites (lines 106 and 131-134). 

Using this technique the encoder can guarantee 
that, after the shifting operations, either 

low < First-qtr < Half < high (la) 

or 

low < Half < Third-qtr < high. (lb) 

Therefore, as long as the integer range spanned by 
the cumulative frequencies fits into a quarter of that 
provided by code-values, the underflow problem 
cannot occur. This corresponds to the condition 

Top -value + 1 , „ 
Max- frequency < — - + 1, 

which is satisfied by Figure 3, since Max- frequency 
= 2 14 - 1 and Top-value = 2 16 - 1 (lines 36, 9). More 
than 14 bits cannot be used to represent cumulative 
frequency counts without increasing the number of 
bits allocated to code-values. 

We have discussed underflow in the encoder only. 
Since the decoder's job, once each symbol has been 
decoded, is to track the operation of the encoder, 
underflow will be avoided if it performs the same 
expansion operation under the same conditions. 

Overflow 

Now consider the possibility of overflow in the 
integer multiplications corresponding to those of 
Figure 2, which occur in lines 91-94 and 190-193 
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(a) 



Top.value 



Thirdjntr 

high 
Half 

low ^ 

First-qtr 



4 



■I 




(b) 



Top.value 




1 




FIGURE 5. Scaling the Interval to Prevent Underflow 



of Figure 3. Overflow cannot occur provided the 
product 

range * Max _ frequency 

fits within the integer word length available, 
since cumulative frequencies cannot exceed 
Max- frequency. Range might be as large as Top-value 
+ 1, so the largest possible product in Figure 3 is 
2 HJ (2 14 - 1), which is less than 2™. Long declarations 
are used for code-value (line 7) and range (lines 89, 
183) to ensure that arithmetic is done to 32-bit preci- 
sion. 



Constraints on the Implementation 

The constraints on word length imposed by under- 
flow and overflow can be simplified by assuming 
that frequency counts are represented in /"bits, and 
code-values in c bits. The implementation will work 
correctly provided 

/< c - 2 
f+csp. 

the precision to which arithmetic is performed. 
In most C implementations, p = 31 if long integers 
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are used, and p = 32 if they are unsigned long. In 
Figure 3, /= 14 and c = 16. With appropriately 
modified declarations, unsigned long arithmetic with 
/= 15 and c = 17 could be used. In assembly lan- 
guage, c = 16 is a natural choice because it expedites 
some comparisons and bit manipulations (e.g., those 
of lines 95-113 and 194-213). 

If p is restricted to 16 bits, the best values possible 
are c = 9 and / = 7, making it impossible to encode a 
full alphabet of 256 symbols, as each symbol must 
have a count of at least one. A smaller alphabet (e.g., 
the 26 letters, or 4-bit nibbles) could still be handled. 

Termination 

To finish the transmission, it is necessary to send a 
unique terminating symbol [EOT symbol* line 56) 
and then follow it by enough bits to ensure that the 
encoded string falls within the final range. Since 
donesncodingi) (lines 119-123) can be sure that low 
and high are constrained by either Eq. (la) or (lb) 
above, it need only transmit 01 in the first case or 10 
in the second to remove the remaining ambiguity. It 
is convenient to do this using the bit -plus _ follow () 
procedure discussed earlier. The input „bit(] proce- 
dure will actually read a few more bits than were 
sent by output-bit{) t as it needs to keep the low end 
of the buffer full. It does not matter what value these 
bits have, since EOF is uniquely determined by the 
last two bits actually transmitted. 

MODELS FOR ARITHMETIC CODING 

The program in Figure 3 must be used with a 
model that provides a pair of translation tables 
index -to shar[\ and char-to-index[], and a cumula- 
tive frequency array cum-freq[]. The requirements 
on the latter are that 

• cum-freq[i — 1] > cum-freq[i]; 

• an attempt is never made to encode a symbol f for 
which cum-freq[i - 1] = cum-freq[i]; and 

• cum_freq[Q] < Max -frequency. 

Provided these conditions are satisfied, the values in 
the array need bear no relationship to the actual 
cumulative symbol frequencies in messages. Encod- 
ing and decoding will still work correctly, although 
encodings will occupy less space if the frequencies 
are accurate. (Recall our successfully encoding eaiil 
according to the model of Table I, which does not 
actually reflect the frequencies in the message.) 

Fixed Models 

The simplest kind of model is one in which symbol 
frequencies are fixed. The first model in Figure 4 
has symbol frequencies that approximate those of 
English (taken from a part of the Brown Corpus [12]). 



However, bytes that did not occur in that sample 
have been given frequency counts of one in case 
they do occur in messages to be encoded (so this 
model will still work for binary files in which 
all 256 bytes occur). Frequencies have been normal- 
ized to total 8000. The initialization procedure 
start -model () simply computes a cumulative version 
of these frequencies (lines 48-51), having first initial- 
ized the translation tables (lines 44-47). Execution 
speed would be improved if these tables were used 
to reorder symbols and frequencies so that the most 
frequent came first in the cum-freq[] array. Since 
the model is fixed, the procedure update „model[) t 
which is called from both encode.c and decode.c, is 
null. 

An exact model is one where the symbol frequen- 
cies in the message are exactly as prescribed by the 
model. For example, the fixed model of Figure 4 is 
close to an exact model for the particular excerpt of 
the Brown Corpus from which it was taken. To be 
truly exact, however, symbols that did not occur in 
the excerpt would be assigned counts of zero, rather 
than one (sacrificing the capability of transmitting 
messages containing those symbols). Moreover, the 
frequency counts would not be scaled to a predeter- 
mined cumulative frequency, as they have been in 
Figure 4. The exact model can be calculated and 
transmitted before the message is sent. It is shown 
by Cleary and Witten [3] that, under quite general 
conditions, this will not give better overall compres- 
sion than adaptive coding (which is described next). 

Adaptive Models 

An adaptive model represents the changing symbol 
frequencies seen so far in a message. Initially all 
counts might be the same (reflecting no initial infor- 
mation), but they are updated, as each symbol is 
seen, to approximate the observed frequencies. Pro- 
vided both encoder and decoder use the same initial 
values (e.g., equal counts) and the same updating 
algorithm, their models will remain in step. The en- 
coder receives the next symbol, encodes it, and up- 
dates its model. The decoder identifies it according 
to its current model and then updates its model. 

The second half of Figure 4 shows such an adap- 
tive model. This is the type of model recommended 
for use with Figure 3, for in practice it will outper- 
form a fixed model in terms of compression effi- 
ciency. Initialization is the same as for the fixed 
model, except that all frequencies are set to one. 
The procedure update -model (symbol) is called by 
both encode symbol () and decode symbol () (Figure 3, 
lines 54 and 151) after each symbol is processed. 

Updating the model is quite expensive because of 
the need to maintain cumulative totals. In the code 
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of Figure 4, frequency counts, which must be main- 
tained anyway, are used to optimize access by keep- 
ing the array in frequency order — an effective kind 
of self-organizing linear search [9]. Update -model {) 
first checks to see if the new model will exceed the 
cumulative-frequency limit, and if so scales all fre- 
quencies down by a factor of two (taking care to 
ensure that no count scales to zero) and recomputes 
cumulative values (Figure 4, lines 29-37). Then, if 
necessary, update -model () reorders the symbols to 
place the current one in its correct rank in the fre- 
quency ordering, altering the translation tables to 
reflect the change. Finally, it increments the appro- 
priate frequency count and adjusts cumulative fre- 
quencies accordingly. 

PERFORMANCE 

Now consider the performance of the algorithm of 
Figure 3, both in compression efficiency and execu- 
tion time. 

Compression Efficiency 

In principle, when a message is coded using arith- 
metic coding, the number of bits in the encoded 
string is the same as the entropy of that message 
with respect to the model used for coding. Three 
factors cause performance to be worse than this in 
practice: 

(1) message termination overhead; 

(2) the use of fixed-length rather than infinite- 
precision arithmetic; and 

(3) scaling of counts so that their total is at most 
Max- frequency. 

None of these effects is significant, as we now show. 
In order to isolate the effect of arithmetic coding, the 
model will be considered to be exact (as defined 
above). 

Arithmetic coding must send extra bits at the 
end of each message, causing a message termina- 
tion overhead. Two bits are needed, sent by 
done-encoding{) (Figure 3, lines 119-123), in order to 
disambiguate the final symbol. In cases where a bit 
stream must be blocked into 8-bit characters before 
encoding, it will be necessary to round out to the 
end of a block. Combining these, an extra 9 bits may 
be required. 

The overhead of using fixed-length arithmetic oc- 
curs because remainders are truncated on division. 
It can be assessed by comparing the algorithm's per- 
formance with the figure obtained from a theoretical 
entropy calculation that derives its frequencies from 
counts scaled exactly as for coding. It is completely 
negligible — on the order of 10" 4 bits/symbol. 



The penalty paid by scaling counts is somewhat 
larger, but still very small. For short messages (less 
than 2 14 bytes), no scaling need be done. Even with 
messages of 10 5 -10 6 bytes, the overhead was found 
experimentally to be less than 0.25 percent of the 
encoded string. 

The adaptive model in Figure 4 scales down all 
counts whenever the total threatens to exceed 
Max- frequency. This has the effect of weighting re- 
cent events more heavily than events from earlier in 
the message. The statistics thus tend to track 
changes in the input sequence, which can be very 
beneficial. (We have encountered cases where limit- 
ing counts to 6 or 7 bits gives better results than 
working to higher precision.) Of course, this depends 
on the source being modeled. Bentley et al. [2] con- 
sider other, more explicit, ways of incorporating a 
recency effect. 

Execution Time 

The program in Figure 3 has been written for clarity 
rather than for execution speed. In fact, with the 
adaptive model in Figure 4, it takes about 420 fis per 
input byte on a VAX-11 /780 to encode a text file, 
and about the same for decoding. However, easily 
avoidable overheads such as procedure calls account 
for much of this, and some simple optimizations in- 
crease speed by a factor of two. The following altera- 
tions were made to the C version shown: 

(1) The procedures input-bit{) f output-bit{), and 
bit -plus -follow{) were converted to macros to 
eliminate procedure-call overhead. 

(2) Frequently used quantities were put in register 
variables. 

(3) Multiplies by two were replaced by additions 
(C "+="). 

(4) Array indexing was replaced by pointer manip- 
ulation in the loops at line 189 in Figure 3 and 
lines 49-52 of the adaptive model in Figure 4. 

This mildly optimized C implementation has an 
execution time of 214 jts/252 fis per input byte, for 
encoding/decoding 100,000 bytes of English text on 
a VAX-1 1/780, as shown in Table II. Also given are 
corresponding figures for the same program on an 
Apple Macintosh and a SUN-3/75. As can be seen, 
coding a C source program of the same length took 
slightly longer in all cases, and a binary object pro- 
gram longer still. The reason for this will be dis- 
cussed shortly. Two artificial test files were included 
to allow readers to replicate the results. "Alphabet" 
consists of enough copies of the 26-letter alphabet to 
fill out 100,000 characters (ending with a partially 
completed alphabet). "Skew statistics" contains 
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TABLE II. Results for Encoding and Decoding 100,000-Byte Files 

VAX-11/780 Macintosh 512 K SUN-3/75 

Output Encode time Decode time Encode time Decode time Encode time Decode time 
(bytes) IfiB) M 0*) M M 0*») 



Mildly optimized C implementation 



Text file 


57,718 


214 


262 


687 


881 


98 


121 


C program 


62,991 


230 


288 


729 


950 


105 


131 


VAX object program 


73,501 


313 


406 


950 


1,334 


145 


190 


Alphabet 


59,292 


223 


277 


719 


942 


105 


130 


Skew statistics 


12,092 


143 


170 


507 


645 


70 


85 


larefutiy optimized assembly-language implementation 












Text file 


57,718 


104 


135 


194 


243 


46 


58 


C program 


62,991 


109 


151 


208 


266 


51 


65 


VAX object program 


73.501 


158 


241 


280 


402 


75 


107 


Alphabet 


59.292 


105 


145 


204 


264 


51 


65 


Skew statistics 


12,092 


63 


81 


126 


160 


28 


36 



Notes: Times are measured in microseconds per byte of uncompressed data. 

The VAX-11/780 had a floating-point accelerator, which reduces integer multiply and divide times. 
The Macintosh uses an 8-MHz MC68000 with some memory wait states. 
The SUN-3/75 uses a 16.67-MHz MC68020. 

AD times exclude I/O and operating-system overhead in support of I/O. VAX and SUN figures give user time from the UNIX* time command; 

on the Macintosh, I/O was expficttty directed to an array. 
The 4.2BSD C compiler was used for VAX and SUN; Aztec C 1 .06g for Macintosh. 



10,000 copies of the string aaaabaaaac\ it demon- 
strates that files may be encoded into less than one 
bit per character (output size of 12,092 bytes = 
96,736 bits). All results quoted used the adaptive 
model of Figure 4, 

A further factor of two can be gained by repro- 
gramming in assembly language. A carefully opti- 
mized version of Figures 3 and 4 (adaptive model) 
was written in both VAX and M68000 assembly lan- 
guages. Full use was made of registers, and advan- 
tage taken of the 16-bit codejvalue to expedite some 
crucial comparisons and make subtractions of Half 
trivial. The performance of these implementations 
on the test files is also shown in Table II in order 
to give the reader some idea of typical execution 
speeds. 

The VAX-11/780 assembly-language timings are 
broken down in Table HI. These figures were ob- 
tained with the UNIX profile facility and are accu- 
rate only to within perhaps 10 percent. (This mech- 
anism constructs a histogram of program counter 
values at real-time clock interrupts and suffers from 
statistical variation as well as some systematic er- 
rors.) "Bounds calculation" refers to the initial parts 
of encode symbol () and decode symbol {) (Figure 3, 
lines 90-94 and 190-193), which contain multiply 
and divide operations. "Bit shifting" is the major 
loop in both the encode and decode routines 
(lines 95-113 and 194-213). The cum calculation in 
decode symbol [), which requires a multiply/divide, 
and the following loop to identify the next symbol 
(lines 187-189), is "Symbol decode." Finally, "Model 



TABLE 111. Breakdown of Timings for the VAX-11/780 
Assembly-Language Version 



Encode time Decode time 
(ms) M 



Text file 

Bounds calculation 32 31 

Bit shifting 39 30 

Model update 29 29 

Symbol decode — 45 

Other __4 _0 

104 135 

C program 

Bounds calculation 30 28 

Bit shifting 42 35 

Model update 33 36 

Symbol decode — 51 

Other _4 _t 

109 151 

VAX object program 

Bounds calculation 34 31 

Bit shifting 46 40 

Model update 75 75 

Symbol decode — 94 

Other _3 __1 

158 241 



update" refers to the adaptive update -model [) proce- 
dure of Figure 4 (lines 26-53). 

As expected, the bounds calculation and model 
update take the same time for both encoding and 
decoding, within experimental error. Bit shifting was 
quicker for the text Hie than for the C program and 
object file because compression performance was 
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better. The extra time for decoding over encoding is 
due entirely to the symbol decode step. This takes 
longer in the C program and object file tests because 
the loop of line 189 was executed more often (on 
average 9 times, 13 times, and 35 times, respec- 
tively). This also affects the model update time be- 
cause it is the number of cumulative counts that 
must be incremented in Figure 4, lines 49-52. In the 
worst case, when the symbol frequencies are uni- 
formly distributed, these loops are executed an aver- 
age of 128 times. Worst-case performance would be 
improved by using a more complex tree representa- 
tion for frequencies, but this would likely be slower 
for text files. 

SOME APPLICATIONS 

Applications of arithmetic coding are legion. By lib- 
erating coding with respect to a model from the mod- 
eling required for prediction, it encourages a whole 
new view of data compression [19]. This separation 
of function costs nothing in compression perfor- 
mance, since arithmetic coding is (practically) opti- 
mal with respect to the entropy of the model. Here 
we intend to do no more than suggest the scope of 
this view by briefly considering 

(1) adaptive text compression, 

(2) nonadaptive coding, 

(3) compressing black/white images, and 

(4) coding arbitrarily distributed integers. 

Of course, as noted earlier, greater coding efficien- 
cies could easily be achieved with more sophisti- 
cated models. Modeling, however, is an extensive 
topic in its own right and is beyond the scope of this 
article. 

Adaptive text compression using single-character 
adaptive frequencies shows off arithmetic coding 
to good effect. The results obtained using the pro- 
gram in Figures 3 and 4 vary from 4.8-5.3 bits/char- 



acter for short English text files (10 3 -10 4 bytes) to 
4.5-4.7 bits/character for long ones (10 5 -10 6 bytes). 
Although adaptive Huffman techniques do exist 
(e.g., [5, 7]), they lack the conceptual simplicity of 
arithmetic coding. Although competitive in compres- 
sion efficiency for many files, they are slower. For 
example, Table IV compares the performance of the 
mildly optimized C implementation of arithmetic 
coding with that of the UNIX compact program that 
implements adaptive Huffman coding using a similar 
model. (Compact's model is essentially the same for 
long files, like those of Table IV, but is better for 
short files than the model used as an example in this 
article.) Casual examination of compact indicates that 
the care taken in optimization is roughly comparable 
for both systems, yet arithmetic coding halves exe- 
cution time. Compression performance is somewhat 
better with arithmetic coding on all the example 
files. The difference would be accentuated with 
more sophisticated models that predict symbols with 
probabilities approaching one under certain circum- 
stances (e.g., the letter u following q). 

Nonadaptive coding can be performed arithmeti- 
cally using fixed, prespecified models like that in 
the first part of Figure 4. Compression performance 
will be better than Huffman coding. In order to mini- 
mize execution time, the total frequency count, 
cum-freq[0] f should be chosen as a power of two so 
the divisions in the bounds calculations (Figure 3, 
lines 91-94 and 190-193) can be done as shifts. En- 
code/decode times of around 60 /is/90 ms should 
then be possible for an assembly-language imple- 
mentation on a VAX-1 1/780. A carefully written im- 
plementation of Huffman coding, using table lookup 
for encoding and decoding, would be a bit faster in 
this application. 

Compressing black/white images using arithmetic 
coding has been investigated by Langdon and 
Rissanen [14], who achieved excellent results using 



TABLE IV. Comparison of Arithmetic and Adaptive Huffman Coding 







Arithmetic coding 






Adaptive Huffman coding 




Output 


Encode time 


Decode time 


Output 


Encode time 


Decode time 




(bytes) 


(ms) 


0*) 


(bytes) 


(/*•) 


(M») 


Text file 


57,718 


214 


262 


57,781 


550 


414 


C program 


62,991 


230 


288 


63,731 


596 


441 


VAX object program 


73,546 


313 


406 


76.950 


822 


606 


Alphabet 


59,292 


223 


277 


60,127 


598 


411 


Skew statistics 


12,092 


143 


170 


16,257 


215 


132 



Notes: The mUdty optimized C implementation was used for arithmetic coding. 
UNIX compact was used for adaptive Huffman coding. 

Times are for a VAX-1 1/780 and exclude I/O and operating-system overhead in support of I/O. 
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a model that conditioned the probability of a pixel's 
being black on a template of pixels surrounding it. 
The template contained a total of 10 pixels, selected 
from those above and to the left of the current one 
so that they precede it in the raster scan. This cre- 
ates 1024 different possible contexts, and for each 
the probability of the pixel being black was esti- 
mated adaptively as the picture was transmitted. 
Each pixel's polarity was then coded arithmetically 
according to this probability. A 20-30 percent im- 
provement in compression was attained over earlier 
methods. To increase coding speed, Langdon and 
Rissanen used an approximate method of arithmetic 
coding that avoided multiplication by representing 
probabilities as integer powers of V2. Huffman cod- 
ing cannot be directly used in this application, as it 
never compresses with a two-symbol alphabet. Run- 
length coding, a popular method for use with two- 
valued alphabets, provides another opportunity for 
arithmetic coding. The model reduces the data to a 
sequence of lengths of runs of the same symbol (e.g., 
for picture coding, run-lengths of black followed by 
white followed by black followed by white . . .). The 
sequence of lengths must be transmitted. The CCITT 
facsimile coding standard [11] bases a Huffman code 
on the frequencies with which black and white runs 
of different lengths occur in sample documents. A 
fixed arithmetic code using these same frequencies 
would give better performance; adapting the fre- 
quencies to each particular document would be 
better still. 

Coding arbitrarily distributed integers is often called 
for in use with more sophisticated models of text, 
image, or other data. Consider, for instance, the lo- 
cally adaptive data compression scheme of Bentley 
et al. [2], in which the encoder and decoder cache 
the last N different words seen. A word present in 
the cache is transmitted by sending the integer 
cache index. Words not in the cache are transmitted 
by sending a new-word marker followed by the 
characters of the word. This is an excellent model 
for text in which words are used frequently over 
short intervals and then fall into long periods of dis- 
use. Their paper discusses several variable-length 
codings for the integers used as cache indexes. 
Arithmetic coding allows any probability distribution 
to be used as the basis for a variable-length encod- 
ing, including — among countless others — the ones 
implied by the particular codes discussed there. It 
also permits use of an adaptive model for cache in- 
dexes, which is desirable if the distribution of cache 
hits is difficult to predict in advance. Furthermore, 
with arithmetic coding, the code spaced allotted to 
the cache indexes can be scaled down to accommo- 



date any desired probability for the new-word 
marker. 

APPENDIX. Proof of Decoding Inequality 

Using one-letter abbreviations for cum^freq, 
symbol, low, high, and value, suppose 

<c[s-l]; 
in other words, 

c[s] ^-/ + l)xc f 0]-l _ e 



where 



<c[s-l]-l, 



r = Ji-/ + 1, 0<c< 



(1) 



r- 1 



(The last inequality of Eq. (1) derives from the 
fact that c[s — 1] must be an integer.) Then we 
wish to show that i ' < u < fi' f where / ' and W 
are the updated values for low and high as defined 
below. 

- c[0]l r C J 



from Eq. (1). 



<c + l - 



c[0]' 



sol' sv since both v and I ' are integers and 
c[0] > 0. 



(b) h' 



L c[0) 



-1 



*'^[ *-'*r' ol " +H- 

7kH"~}-°- 



from Eq. (l), 



> u + 
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